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ABSTRACT 

Next-generation sequencing (NGS) technologies have raised a chal- 
lenging de novo genome assembly problem that is further amplified in 
recently emerged single-cell sequencing projects. While various NGS 
assemblers can use information from several libraries of read-pairs, 
most of them were originally developed for a single library and do 
not fully benefit from multiple libraries. Moreover, most assemblers 
assume uniform read coverage, condition that does not hold for 
single-cell projects where utilization of read-pairs is even more chal- 
lenging. We have developed an exSPAnder algorithm that accurately 
resolves repeats in the case of both single and multiple libraries of 
read-pairs in both standard and single-cell assembly projects. 
Availability and implementation: http://bioinf.spbau.ru/en/spades 
Contact: ap@bioinf.spbau.ru 

1 INTRODUCTION 

Most of existing next-generation sequencing (NGS) platforms 
generate read-pairs — pairs of reads (called mates) that are 
sequenced from different ends of a genomic fragment with ap- 
proximately known length (called the insert size). Because the 
insert size usually exceeds the length of a single read, read- 
pairs may match up unique regions surrounding repeats that 
are longer than the read length. A combination of several 
libraries of read-pairs with different insert sizes is often used to 
produce high-quality assemblies (Butler et al. 2008; Bresler et al., 
2012). Paired-end libraries usually have insert size <1 kb and are 
used for resolving relatively short repeats. Jumping libraries are 
characterized by an average insert size of anywhere from 1 to 20 
kb and are helpful in resolving longer repeats and contig scaf- 
folding (inferring the order of contigs in the genome). However, 
because of a high insert size variation, information from jumping 
libraries is rather difficult to use for the purpose of assembly. 

The problem of using multiple read-pair libraries was previ- 
ously addressed by ALLPATHS-LG (Gnerre et al, 2011), Ray 
(Boisvert et al, 2010), Velvet (Zerbino and Birney, 2008) and 
some other assemblers. However, these tools are designed for 
standard (mutlicell) assemblies and do not perform well on 
single-cell datasets. On the other hand, the single-cell assemblers 
ESC (Chitsaz et al, 2011), IDBA-UD (Peng et al, 2012) and 
SPAdes (Bankevich et al, 2012) are designed for a single 
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read-pair library. In addition, the recently proposed Paired de 
Bruijn Graph algorithms for repeat resolution (Medvedev et al, 
2011; Pham et al, 2013; Vyahhi et al, 2012) also focus on 
a single library and it remains unclear how to extend them to 
multiple libraries. 

We present exSPAnder algorithm that works with both single 
and multiple libraries in standard and single-cell assembly pro- 
jects. exSPAnder uses a simple path extension approach for 
repeat resolution that was originally proposed in the Ray assem- 
bler [and later used in Telescoper (Bresler et al, 2012)] and com- 
bines it with some ideas from the Rectangle Graph approach 
(Bankevich et al, 2012; Vyahhi et al, 2012). Given a set of 
paths in the assembly graph (Bankevich et al, 2012) (i.e. simph- 
fied de Bruijn graph (Compeau et al, 2011; Pevzner et al, 2001) 
of ^-mers in reads after removal of bulges, tips and chimeric 
edges), exSPAnder attempts to extend each path with the goal 
to generate longer paths. For a path P ending in a vertex v, we 
consider all edges starting at v (referred to as extension edges) and 
compute Score p(e) for each extension edge. To compute 
Score p(e) we analyze all reads that map to path P and whose 
mates map to e. Thus, Score p(e) reflects our confidence that an 
extension of the path P by the edge e is correct. We note that to 
properly map read-pairs and calculate Score p{e), the total length 
of path P and edge e should be longer than the insert size. 

In addition to function Scorep(e), exSPAnder uses a decision 
rule Extend(P) that either chooses one of the extension edges to 
extend the path P or makes the decision to stop growing this 
path beyond the ending vertex of P. The procedure is iterated 
over all the paths until no path can be further extended. To 
initiate this algorithm one can start with a set of single-edge 
paths formed by all sufficiently long edges in the assembly 
graph. The resulting paths are output as contigs after removing 
the paths that are contained within other paths as well as remov- 
ing non-informative overlaps (i.e. suffixes of paths that represent 
prefixes of other paths). 

This simple approach is merely a framework and, depending 
on the specifics of the scoring function and the decision rule, it 
can be either efficient (hke in the Ray assembler) or disastrous. 
The authors of Telescoper made an attempt to improve on Ray's 
scoring function and to substantiate it with rigorous statistical 
analysis. However, scoring functions in both Ray and Telescoper 
are not universal, e.g. they assume the uniform genome coverage 
by reads, condition that does not hold for single-cell data. 

We demonstrate that exSPAnder works well on single-cell 
datasets with multiple libraries. We also show that exSPAnder 
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Table 1. Information on the B.faecium isolate dataset and the S.aureus 
single-cell dataset 
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Fig. 1. Plots of the insert size distributions for B.faecium isolate 
(a) paired-end and (b) jumping library, and S.aureus single-cell dataset 
with (c) paired-end and (d) jumping library. The distributions were com- 
puted by mapping reads to the B.faecium str. DSM4810 (Lapidus et al, 
2009) and S. aureus str. USA300 substr. FPR3757 (Diep et al, 2006) 
reference genomes, respectively. All plots are in the logarithmic scale 

[implemented as a part of SPAdes (Bankevich et al, 2012)] im- 
proves on existing assemblers on standard bacterial datasets and 
outperforms such popular assemblers as ABySS (Simpson et al, 
2009), Ray (Boisvert et al, 2010), SOAPdenovo (Li et al, 2010) 
and Velvet (Zerbino and Birney, 2008). ALLPATHS-LG 
(Gnerre et al, 201 1) is an excellent assembler whose applications, 
however, are limited to specially constructed read-pair libraries. 
On such libraries exSPAnder and ALLPATHS-LG generate 
comparable results (exSPAnder generates longer contigs but 
ALLPATHS-LG generates longer scaffolds). 



2 ANALYSIS OF READ-PAIR LIBRARIES 

As we mentioned in the introduction, jumping libraries present 
additional challenges for genome assembly owing to their high 
variations in the insert size. Additionally, jumping libraries have 
high rate of chimeric read-pairs — read-pairs that either have ab- 
normal insert size or incorrect orientation. Chimeric read-pairs 
further complicate utilization of such libraries while resolving 
repeats. 

Below we present analysis of the insert size distributions for 
the following data: Brachybacterium faecium isolate dataset (read 
length 1 50 bp) and Staphylococcus aureus single-cell dataset (read 
length 101 bp). Both datasets contain one paired-end and one 
jumping library. To analyze read-pair libraries we aligned reads 
to the B.faecium and S.aureus reference genomes using Bowtie 2 
(Langmead and Salzberg, 2012) and computed the chimeric read- 
pair rates and insert size distributions. 

As Figure 1 and Table 1 demonstrate, both isolate and single- 
cell paired-end libraries have small variations of the insert size. In 
contrast, the jumping library from the B.faecium dataset has a 
much higher variation in the insert size and higher rate of chi- 
meric read -pairs (Table 1). At the same time, in addition to the 
large insert size variations, S.aureus jumping library contains 
only 22% of all read-pairs aligned with the proper reverse- 
forward orientation. Thirty-six percent of all read-pairs have 



Dataset 


B.faecium 




S.aureus 




Library 


Paired-end 


Jumping 


Paired-end 


Jumping 


Number of reads 


13 M 


41 M 


38 M 


41 M 


Average coverage 


400 X 


llOOx 


1050x 


1050x 


Coverage span 


210-570X 


0-3000X 


0-3500X 


0-3500X 


Insert size 


270 bp 


7.5 kb 


210bp 


1.8 kb 


Insert span 


150-400 bp 


6-10 kb 


180-230 bp 


0.5-4 kb 


Chimeric read-pairs (%) 


1 


9 


3 


50 


Unaligned read-pairs (%) 


16 


10 


6 


28 



Note: Insert span is the shortest insert size interval that contains at least 95% of 
properly aligned read-pairs. Unaligned reads refer to the percentage of read-pairs 
that have at least one read unaligned. Chimeric read-pairs refer to the percentage of 
chimeric read-pairs among all read-pairs. All statistics was obtained using Bowtie 2 
(Langmead and Salzberg, 2012). Coverage span is the smallest coverage interval that 
includes a least 95% of all genomic positions 



incorrect forward-reverse orientation (correspond to the left 
peak in Fig. Id) and 14% are classified as chimeric read-pairs 
of other types. Forward-reverse read-pairs in jumping libraries 
represent an artifact of the sample preparation and are common 
for datasets of different types. 

Despite the fact that various artifacts of jumping libraries 
make it difficult to incorporate them into existing assembly 
tools, exSPAnder uses jumping libraries to generate high-quahty 
assembhes. 



3 EXSPANDER ALGORITHM 

exSPAnder uses an assembly graph constructed by SPAdes 
(Bankevich et al, 2012) and a set of read-pair libraries. For 
each library, we map read-pairs to the long edges of the assembly 
graph and estimate the average insert size along with its confi- 
dence interval — a shortest insert size interval that contains at least 
80% of properly aligned read-pairs. These estimates are used as 
parameters of the scoring function and the decision rule. 

3.1 The decision rule 

3.1.1 Single library Given a path P, we define a winner as an 
edge e with the maximal score Score p{e) among all extension 
edges for P. Similarly, a contender is defined as an extension edge 
with the second best score. The winner edge is called the strong 
winner if (i) Score p{winner) >0 and (ii) Score p(w inner) > C • 
Score p{contender), where 0 and C are parameters of the algo- 
rithm, which are discussed below. If the path P has a single 
extension edge (which is obviously the winner), only the first 
condition is used. The decision rule is defined as follows: 



Extend{P) = 



e, if edge e is the strong winner for path P 
0, if there is no strong winner for path P 



3.1.2 Multiple libraries The decision rule described above can 
be generalized for several read-pair libraries. Consider M read- 
pair libraries, which are sorted in the order of increasing insert 
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Fig. 2. (a) Reads r and / form a read-pair mapping to consecutive edges e 
and e' in the assembly graph at positions xq and respectively. 

(b) Representation of a read-pair (r/) as a point in a rectangle (e,^')- 

(c) 'Ideal read-pairs' with the exact insert size / connecting edges e and e' 
form a 45° line within a rectangle, (d) Read-pairs from the real sequencing 
data with variations in the insert size represented as points within a rect- 
angle. Most points are located within the confidence strip providing the 
evidence that edges e and are supported by the read-pairs and are 
genome-consecutive, (e) A rectangle formed by a pair of edges that has 
few points falling into the confidence strip revealing that e and e' are not 
genome-consecutive edges 

sizes and the associated decision rules Extendi{P) for \ <i<M. 
We process the libraries in this order because our analysis re- 
vealed that the smaller is the insert size of a library (and its 
variation), the more rehable is the decision rule for this individual 
library. We thus select the library with the smallest index / that 
has the strong winner and defme the decision rule for multiple 
libraries Extend{P) as simply Extendi{P). If neither library has a 
strong winner, we define Extend{P) = 0. 



3.2 The scoring function 

3.2.1 The support function We first consider an idealized case 
when the genome defines a genomic path in the assembly graph. 
We say that an edge e' follows edge e at a distance D if the 
distance between starts of these edges in the genomic path is 
D. We define a boolean function Support ^{e^e') that reflects 
our confidence that edge e' follows edge e in the genome at dis- 
tance D. Below we describe how to calculate Supportn(e,e'). 

Let / and [IminJmax] be the mean and the confidence interval of 
the insert size for a read-pair library formed by reads of length 
ReadLength. Consider consecutive edges e and e' in the assembly 
graph and a read-pair (r,/) such that read r maps to e at position 
Xq and read / maps to e' in position (Fig. 2a). We say that the 
read -pair {r/') connects edges e and e' . Figure 2b shows a rect- 
angle formed by the edges e and e' [further simply referred to as 
rectangle {e.e')] with the read-pair {r/) represented as a point 
(xo,yo) within the rectangle. If edges e and e' represent consecu- 
tive regions in the genome, then the genomic distance from the 
start of read r to the start of read / equals to Length(e) — xo+yo, 



where Length(e) stands for the length of edge e. Therefore, in the 
case of an 'ideal read-pair' (r/) (e.g. a read-pair with the exact 
insert size I), yo = xq — d, where d= Length(e) — 1+ ReadLength. 
Thus all 'ideal read-pairs' mapping to edges e and e' form a set 
of integer points on the 45° line y = x- d within the rectangle 
(Fig. 2c). Because the read-pairs from the real sequencing data 
have variations in the insert size, their corresponding points are 
typically scattered in the strip between the 45° lines y = x - dmin 
and y = X - d^ax, where 

dfnin = Length(e) — Imin + ReadLength, 

dmax = Length(e) — Imax + ReadLength. 

This strip in the rectangle is further referred to as the confidence 
strip (Fig. 2d). 

Let F{x) be the empirical distribution of the insert size and S 
be a set of all integer points within the confidence strip in the 
rectangle {e,e'). We define the expected number of read-pairs 
within the confidence strip (under the assumption of the uniform 
coverage) as 

Expected(e, e') = ^ F{I{x, y)), 

where I{x, y) = Length(e) — x + y + ReadLength represents the 
insert size of a read-pair that corresponds to the point (x, y). 
We also define Points(e,e') as the total number of read-pairs 
(from the real dataset) that correspond to the points within the 
confidence strip. The notion of density is defined as 



Density(e, e') = 



Points(e, e') 
Expected(e, e') ' 



We set Density(e,e') = 0, if Expected(e,e') = 0. The points outside 
the confidence strip may represent read-pairs with somewhat 
larger deviations from the mean insert sizes or chimeric read- 
pairs. Our analysis revealed that being conservative (e.g. limiting 
analysis to the confidence strip) allows one to avoid most of the 
assembly errors caused by chimeric read-pairs, particularly in 
single-cell projects. 

We distinguish between notions of genome-consecutive and 
graph-consecutive edges and emphasize that graph-consecutive 
edges are not necessarily genome-consecutive. The decision 
about which graph-consecutive edges are genome-consecutive is 
an important part of any assembler. Figure 2d and e illus- 
trate how rectangles help us to make such decisions: both rect- 
angles correspond to graph-consecutive edges, but only rectangle 
in Figure 2d is formed by the pair of genome-consecutive edges. 

The described notions of Expected(e,e'), Points(e,e') and 
Density(e,e') (defined for the case when edges e and e' are 
genome-consecutive) can be generalized for the case when e 
and e' are not consecutive genomic edges under the assumption 
that genomic distance between them is D. In this case the confi- 
dence strip [further referred to as Stripj)(e,e')] is bounded by the 
lines y = X - dfnin and y = x - 



-D 



- ReadLength, 



Point Sj)(e,e') similarly represents the number of points within 
strip Strip r){e,e'). Expectedn(e,e') and Density n(e,e') are defined 
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Fig. 3. (a) An example of an assembly graph with the genomic paths (pi, 
Pi. Pz. ^i) and {p\,p'2,p^, 62). (b, e) The composite rectangles for correct 
genomic extension of each path: in these cases the points are evenly 
distributed within the confidence strip and the resulting score is equal 
to 1. (c, d) The composite rectangles that correspond to incorrect exten- 
sions edges of these two paths. In each of these cases, at least one simple 
rectangle contains few points within the confidence strip 



using the same formulas and the corresponding confidence strip. 
Clearly, for two genome-consecutive edges Expected(e,e') = 
Expected]^(e,e') and Density{e,e') = Density iie,e'), where 
L = Length{e). 

The support function reflects whether the number of read- 
pairs connecting edges e and e' supports the conjecture that e' 
follows e in the genome at distance D\ 



Support j)(e, e') = 



1, Density D{e, e')>^ 
0, otherwise 



where ^ is a parameter of the algorithm, which is automatically 
computed for each read-pair library based on the chimeric read- 
pair rate (see below). For the standard isolate datasets this par- 
ameter corresponds to the coverage cutoff for read-pairs. For 
single-cell datasets this parameter is usually set to be very low 
to retain the regions with low coverage, which are typical for sin- 
gle-cell projects. If Supportj)(e,e') = 1, we say that the rectangle 
(e,e') is supported by the read-pairs. 

3.2.2 The naive scoring function To explain the intuition behind 
exSPAnder, we first introduce the naive scoring function. We 
further modify the naive scoring function to arrive to the 
advanced scoring function used in the real exSPAnder 
implementation. 

A path P = (pi,. . . ,p„^ and its extension edge e can be repre- 
sented as a composite rectangle formed by m simple rectangles 
(Pj,e) containing points that correspond to read-pairs connecting 
edges of P and e. Figure 3b shows an example of a composite 
rectangle, which is formed by a path (pi,p2, Ps) and its extension 
edge ei and consists of three simple rectangles. The notion of the 



confidence strip remains (it now consists of up to m substrips 
within simple rectangles), except that it is bounded by the lines 



y ■ 



I and y 

dmin 



X - 



Length(P) — Imin + ReadLength, 

dmax = Length(P) — I max + ReadLength. 

For an edge pj from the path P we define the expected number 
of points in the confidence substrip within the simple rectangle 
ippe) as Expectedj)j(pj, e), where Dj is the distance between 
start of pj and start of e according to the path P 
(i.e. Dj = Lengthipi)). We consider rectangles (pj, e),j = star 
t, . . . , end such that Expectedj).(pj, e)>0 and introduce the func- 
tion Score p(e) as the fraction of the total number of expected 
read-pairs in these rectangles with Supportojipj, e)=\: 



Score p(e) = 



SupportDjiPj, e) ' Expectedojipj, e) 

start 

end 

Expectedojipj, e) 



We set Score p{e) = 0 if all simple rectangle have zero expected 
read-pairs. 

Figure 3a shows paths P = (pi, p2, Ps) and P =(pi,P 2^P^) 
and its extension edges ^1,^2- Let (pi, p2, Ps, ^i) and (pi,p2,p3, 
e2) be the true (but unknown) genomic paths. Figure 3b shows 
the composite rectangle for path P and its correct extension e\, in 
which points within the confidence strip are rather evenly dis- 
tributed resulting in Scorep{e\) = 1. Figure 3c shows the com- 
posite rectangle for path P = (pi, p2, Ps) and its incorrect 
extension edge ^2. Because (pi,P2,P3, ^2) is a genomic path, dens- 
ity of the points in the sectors of the confidence strip correspond- 
ing to edges pi and ps is high. However, edge p2 of the path P 
does not support extension edge ^2 because there are few points 
in the rectangle (p2, ^2)- Additionally, Figure 3d and e shows 
composite rectangles for all possible extension edges for path 

R' = (P\,P'2,P3)- 

Because the defined scoring function does not linearly depend 
on read coverage, it is well suitable for both single-cell and stand- 
ard sequencing projects. At the same time, considering only read- 
pairs with insert size in [IminJmax] (which correspond to points 
within the confidence strip) allows one to filter out most of the 
chimeric read-pairs (common for single-cell datasets) and to min- 
imize their influence on the scoring function. 

3.2.3 The advanced scoring function The naive scoring function 
Score p(e) described above works well in many cases but may be 
too conservative when the path P contains repetitive edges (edges 
that are visited more than once by the genomic traversal). 
Figure 4 illustrates the case when the path P has a repetitive 
edge and motivates the need for further improvements in the 
scoring function. 

Figure 4a shows an assembly graph with four unique edges 
(p,p\ e and e') and a single repetitive edge Pr with multiplicity 2. 
We assume that paths (p,Pr,e) and {p' ,pr,e') are genomic and 
paths (p,pr,e') and (p',pr,e) are non-genomic. Our goal is to 
design an algorithm that correctly extends genomic paths (p,Pr) 
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Fig. 4. Scoring a path that contains repetitive edges, (a) An example of 
the assembly graph with a repetitive edge pr. (b) A composite rectangle for 
the correct extension e of path {p,Pr). (c) A composite rectangle for the 
incorrect extension e' of the path {p,p,) 



and ip' ,Pr) into longer genomic paths {p,Pr,e) and {p' ,Pr,e'), 
respectively. 

Consider a path P = (p,Pr) and composite rectangles (P,e) and 
(P,e') (Fig. 4b and c). As Figure 4 illustrates, Score p{e) is similar 
to Score p(e'), implying that there is no strong winner for path P 
and preventing us from extending the path P by edge e. 
However, because the repetitive edge Pr supports both extension 
edges e and e\ it does not provide any valuable information 
about the correct extension of the path P. Therefore, to make 
a decision about extending the path P by an extension edge e, we 
should have excluded p^ from the consideration as a repetitive 
edge. Because we do not know in advance which edges of the 
assembly graph correspond to repeats in the genome, we classify 
Pr as repetitive because it supports both extension edges e and e'. 

Below we present the exSPAnder algorithm that allows us 
to exclude repetitive edges from contributing to scoring. An 
extension edge e of path P is called an active edge if 
C • Score p(e) > Score p{winner). At the first step of the algorithm 
we score all extension edges and form a set of active edges A. An 
edge pj in path P is classified as repetitive if it supports all active 
edges, i.e. Supportojipj, e/)= 1 for all ei e A. At the second step 
we mark all repetitive edges pj e P and recalculate scores of all 
edges in A ignoring these repetitive edges. We then update A by 
removing all non-active edges and iterate the process. The pro- 
cess continues until yet another iteration does not change the set 
of active edges A. If A={e} and Score p(e) >0 (which means that 
e satisfies both conditions in the decision rule) the extension edge 
e is considered to be a strong winner and added to the path P. 
Otherwise, we stop extending path P. 

We further demonstrate the work of the exSPAnder algo- 
rithm using a simple assembly graph shown in Figure 5. The 
paths (p\,p2,P3,ei), (p[,p2,p3,e2) and (p'^^p^^e^) are genomic 
paths, which means that edges p2 and p^ are repetitive and have 
multiplicities 2 and 3, respectively. 

Let P = (pi, p2, Ps) be a path we aim to extend. We first cal- 
culate scores of all extension edges using the composite rectangles 
(Fig. 6a-c) and form a set of active edges ^={^1,^2, ^3} 
based on their scores (marked red in Fig. 6d). Because Supportn^ 




Fig. 5. An example of the assembly graph with repetitive edges p2 and p^ 



(P3, ei) = 1 for / = 1, 2, 3, edge />3 is classified as repetitive and is 
removed from further consideration (Fig. 6e). We now recalcu- 
late scores for the extension edges in A ignoring repetitive edge p^ 
(Fig. 6f-h) and remove non-active edge ^3 from A (Fig. 6i). 
Using the updated set A={e\,e2} we again proceed to the 
repeat detection step and mark edge p2 as repetitive because Su 
pportD2(P2, ei) = \ for /= 1, 2 (Fig. 6j). Finally, we once again 
recalculate scores of the extension edges in A (Fig. 6k-m) and 
remove ^2 as non-active (Fig. 6n). The extension edge Ci remains 
the only active edge in and is used to extend path P. 

Extensive tests of the advanced scoring function revealed that 
it works well across diverse datasets including single-cell jumping 
libraries with high variations in the insert size, extremely non- 
uniform coverage and large number of chimeric reads and chi- 
meric read-pairs (see Section 4). 

3.3 Scaffolding 

After all paths are constructed, we consider all pairs of paths 
that form composite rectangles with non-zero number of points 
(Fig. 7). For each such pair of paths P and P' we can check 
whether points in the corresponding composite rectangle are 
scattered around a certain 45° line using SPAdes distance esti- 
mation procedure (Nurk et aL, 2013). When SPAdes provides the 
estimated distance D between P and P', we use exSPAnder to 
verify the conjecture that P' follows P at distance D. If this con- 
jecture is supported and does not contradict to any other con- 
jectures about these paths, we extend the path P by P' (the 
scaffolding step). We estimate the gap length between the paths 
as Z) — Length{P) and insert the appropriate number of 'N' sym- 
bols (unspecified nucleotide) between end of P and start of P' . If 
paths P and P' overlap, we construct their overlap alignment to 
correct distance D. 

3.4 Choice of the parameters 

3.4.1 The scoring function We select the parameter ^ as a 
threshold for the density of the read-pairs within the confidence 
strip. We therefore assume that rectangles with the density below 
^ contain mostly chimeric read-pairs and should be ignored 
while calculating the score of an extension edge. To select ^ 
for a particular read-pair library, we estimate the distribution 
of the densities for rectangles that contain only chimeric read- 
pairs (false rectangles) and for rectangles that contain only non- 
chimeric read-pairs (true rectangles). 

To partition all read-pairs into chimeric and non-chimeric, one 
needs the complete genome that is unavailable. To get around 
this, we identify a subset of chimeric reads using the long edges in 
the assembly graph (e.g. edges longer than N50) that can be 
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Fig. 6. A step-by-step example of the exSPAnder algorithm, (a-c) Forming a set of active edges {ci, 62, e^} (marked red) for the path P = (pi, p2, P3) 
using the corresponding composite rectangles, (d, e) Classifying of edge pi, as repetitive and removing it from further consideration (marking gray). Edges 
that are not classified as repetitive are colored in blue, (f-h) Recalculating scores of the extension edges and updating the set of active edges, (i, j) 
Removing repetitive edge p2- (k-m) Recalculating scores for the remaining active edges {ei, €2} and removing €2 as non-active, (n) Selecting the only 
active edge ei as an extension for the path P 
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Fig. 7. An example of a composite rectangle formed by paths (pi, p2, P3) 
and (p\,P2,P3) 



viewed as subgenomes of the complete genome. By mapping a 
read-pair to a long edge we can compute its insert size and thus 
classify read-pairs with abnormally large insert sizes as chimeric. 

To estimate the parameter ^, we generate a large number of 
equally sized rectangles by artificially 'splitting' each long edge 
into shorter edges of equal length (e.g. 100 bp) called uni-edges. 
We then consider rectangles formed by all pairs of uni-edges 
within the same long edge. Because we know the exact distance 
between such uni-edges, we can calculate the expected number of 
read -pairs Expectedj)(e 1,62) for all pairs of uni-edges within the 
same long edge. If uni-edges ei and 62 come from the same long 
edge and Expectedj)(ei,e2) = 0, then all points in the confidence 
strip Strip r){e\,e2) represent chimeric read-pairs connecting e\ 
and €2. Such edge-pair (^1,^2) is classified as a false edge-pair. 
Otherwise, if ExpectedD(e i, e2)>0, we assume that all points 
within Stripj)(e\, e2) represent correct read-pairs and classify 
(^1,^2) as a true edge-pair. 

Edges ei and ^2 and a parameter D define the strip 
Stripo{e\,e2) with the number of points PointSn(e i,e2). For a 
pair of edges ei and ^2, we define 

D* = argmax Pointsniei, ^2). 

D 

When (^1,^2) is a false edge-pair, D defines a confidence strip S 
tripD*{e\, ^2) with the maximum number of chimeric read-pairs. 
To compute the threshold ^, we assume this worst-case scenario 
for all pairs of uni-edges (within the same long edge) by using 
distance D (rather than the known genomic distance) for calcu- 
lating the densities Density D*{e\, ^2)- 

For a certain value of ^ we define false positives (false nega- 
tives) as the false (true) edge-pairs that have density higher 
(lower) than ^' . Figures 8a and b illustrate how false-positive 
rate (green) and false -negative rate (blue) depend on the param- 
eter ^. exSPAnder selects ^ that corresponds to the intersection 
point of the false-positive and false-negative plots. Our bench- 
marking revealed that such choice of the parameter ^ allows one 
to filter out the rectangles containing only chimeric reads-pairs 
based on their densities. Additional analysis revealed that esti- 
mating the parameter ^ is an important step in the exSPAnder 
algorithm because varying this parameter may significantly affect 
the assembly quality (see Section 3 in Appendix). 




Fig. 8. Plots of the false-positive (green) and false-negative (blue) rates 
for (a) B.facium and (b) S. aureus paired-end libraries 



3.4.2 The decision rule Our analysis revealed that varying par- 
ameters C and 0 within specified ranges (see Section 3 of the 
Appendix) hardly affects the quality of the resulting assemblies. 
However, selecting inappropriate C and 0 may result in a dete- 
riorated performance of exSPAnder. Thus, we arbitrarily select 
these parameters within the ranges specified in Section 3 of 
Appendix (analysis of diverse sequencing datasets supports the 
default values C = 1.5 and 0 = 0.5). 

4 RESULTS 

4.1.1 Datasets We have compared exSPAnder (coupled with 
SPAdes assembler) with several popular assemblers on the 
B.faecium isolate dataset (genome size 3.6 Mb) and the 
S. aureus single-cell dataset (genome size 2.9 Mb). For each 
dataset we have generated assemblies of (i) only paired-end 
library and (ii) both paired-end and jumping libraries. Section 
1 of Appendix provides a detailed description of both datasets. 

4.1.2 Benchmarking assemblies ABySS 1.3.6, Ray 2.0.0, Velvet, 
Velvet-SC and SOAPdenovo 2.0.4 were run with ^-mer size 55. 
IDBA-UD 1.1.1 was run in its default iterative mode. The 
authors of (Peng et al., 2012) released this new version of 
IDBA-UD that is capable of using several paired-end libraries, 
but there is no manuscript yet covering this new development. 
ALLPATHS-LG was run with the default parameters; however, 
we down-sampled jumping library for the B.faecium dataset to 
generate lOOx coverage required by ALLPATHS-LG. SPAdes 
2.4 (previous version of SPAdes that did not include exSPAnder 
and did not support multiple libraries) was run in its default 
iterative mode with = 21, 33, 55, 77 for the B.faecium dataset 
and k = 2\, 33, 55 for the S. aureus dataset. exSPAnder (coupled 
with SPAdes) was run using the default parameters. 

To analyze the resulting assemblies we used QUAST 2.2 
(Gurevich et al., 2013) that reports various parameters including 
NG50 (similar to N50, but is calculated with respect to the 
reference genome size), the total number of contigs/scaffolds, 
the length of the longest assembled contig/scaffold, the number 
of misassemblies and the fraction of genome mapped. QUAST 
defines a misassembly breakpoint as a position in the contig/ 
scaffold, such that its left and right flanking sequences either 
align to the reference genome over 1 kb away from each other, 
or overlap by > 1 kb, or align on opposite strands or different 
chromosomes (Gurevich et al., 2013). To compare assemblers we 
used both contigs and scaffolds of length exceeding 500 bp. 
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Table 2. Comparison of contigs for the B.faecium isolate dataset 



Assembler 


NG50 


Number 


Largest 


Number 


GF 






of contigs 




of mis 




Only paired-end 


library 










ABySS 


203 


40 


672 


0 


99.9 


Ray 


114 


51 


436 


1 


98.9 


SOAPdenovo 


20 


333 


61 


0 


98.8 


Velvet 


144 


47 


550 


0 


99.4 


Velvet-SC 


163 


46 


550 


0 


99.4 


IDBA-UD 


202 


39 


483 


0 


99.4 


SPAdes 2.4 


361 


24 


635 


1 


99.7 


exSPAnder 


380 


22 


672 


1 


99.5 


Both paired-end and jumping 


libraries 








ABySS 


203 


40 


672 


0 


99.9 


ALLPATHS-LG 313 


21 


686 


0 


99.5 


Ray 


87 


88 


416 


2 


96.8 


SOAPdenovo 


20 


333 


61 


0 


98.8 


Velvet 


103 


75 


242 


11 


99.0 


Velvet-SC 


253 


40 


545 


15 


99.8 


IDBA-UD 


207 


41 


483 


0 


99.4 


exSPAnder 


3268 


2 


3268 


1 


99.9 



Note: NG50 is given in kb; number of contigs is the total number of contigs 
>500bp; largest stands for the length (in kb) of the longest contig assembled; 
number of mis is the number of misassembUes; GF stands for the fraction of 
genome mapped given in percent. In each column, the best value is indicated 
in bold. 



Tables 2 and 3 show the benchmarking results for the 
B.faecium isolate dataset. Interestingly, the single-cell assemblers 
(IDBA-UD and exSPAnder coupled with SPAdes) as well as 
ABySS performed well on the B.faecium isolate dataset and pro- 
duced contigs with the largest NG50 in the case of a single 
library. While AbySS generated the assembly with the maximal 
genome fraction, manual inspection revealed that it reflects the 
specifics of ABySS and QUAST reporting (mapping each repeat 
to a single position in the genome) rather than real superiority of 
ABySS by this metric. 

In the case of two libraries, exSPAnder produced the best 
contigs while ALLPATHS-LG produced the best scaffolds. 
The complexity of using jumping libraries is reflected in a dete- 
riorated performance of ABySS and Ray (reduction in NG50) as 
well as Velvet and Velvet-SC (dramatic increase in the number of 
misassemblies). 

Tables 4 and 5 compare various assemblers on the S.aureus 
single-cell dataset. This comparison highlights the complexity of 
both (i) assembling single-cell datasets and (ii) using jumping 
libraries. For example, SOAPdenovo produced assemblies of 
poor quality for single-cell data (we decided not to include it in 
Tables 4 and 5). Similarly, ABySS produced assemblies with high 
number of misassemblies for the single-cell data. Velvet and 
Velvet-SC are not included in the benchmark experiment for 
jumping libraries testing because they also produce low-quality 
assemblies when both paired-end and jumping libraries are used 
simultaneously. IDBA-UD performed well on a single paired- 
end library, but produced an assembly of lower quality when 
both libraries were provided (decreased NG50). exSPAnder 



Table 3. Comparison of scaffolds for the Bfaecium isolate dataset 



Assembler NG50 Number of Largest Number GF 

scaffolds of mis 



Only paired-end hbrary 



ABySS 


383 


24 


676 


0 


99.9 


Ray 


204 


31 


553 


1 


98.9 


SOAPdenovo 


477 


26 


724 


0 


99.3 


Velvet 


477 


28 


724 


0 


99.4 


Velvet-SC 


477 


28 


724 


0 


99.4 


IDBA-UD 


250 


30 


671 


0 


99.4 


SPAdes 2.4 


361 


22 


671 


1 


99.7 


exSPAnder 


380 


22 


672 


1 


99.5 


oth paired-end and jumping 


; libraries 








ABySS 


250 


30 


739 


1 


99.9 


ALLPATHS-LG 


3610 


7 


3610 


1 


99.5 


Ray 


106 


75 


416 


2 


96.8 


SOAPdenovo 


480 


28 


810 


2 


99.4 


Velvet 


2651 


14 


2651 


78 


99.1 


Velvet-SC 


945 


102 


1381 


500 


98.9 


IDBA-UD 


1002 


9 


1692 


0 


99.4 


exSPAnder 


3268 


2 


3268 


1 


99.9 



Note: NG50 is given in kb; number of scaffolds is the total number of scaffolds 
>500bp; largest stands for the length (in kb) of the longest scaffold assembled; 
number of mis is the number of misassemblies; GF stands for the fraction of 
genome mapped given in percent. In each column, the best value is indicated in 
bold. 



Table 4. Comparison of contigs for the S.aureus single-cell dataset 



Assembler NG50 Number of Largest Number GF 

contigs of mis 



Only paired-end Hbrary 



ABySS 


27 


914 


91 


262 


98.0 


Ray 


21 


306 


108 


14 


88.7 


Velvet 


10 


538 


56 


2 


93.2 


Velvet-SC 


9 


616 


56 


4 


94.2 


IDBA-UD 


75 


390 


161 


7 


98.6 


SPAdes 2.4 


98 


400 


230 


8 


99.1 


exSPAnder 


148 


366 


275 


3 


98.6 


oth paired-end and 


jumping 


; libraries 








ABySS 


27 


914 


91 


262 


98.0 


ALLPATHS-LG 


15 


283 


75 


26 


79.9 


Ray 


100 


178 


486 


21 


93.5 


IDBA-UD 


47 


415 


161 


7 


98.6 


exSPAnder 


314 


322 


603 


9 


99.3 



Note: NG50 is given in kb; number of contigs is the total number of contigs >500 bp; 
largest stands for the length (in kb) of the longest contig assembled; number of mis is 
the number of misassemblies; GF stands for the fraction of genome mapped given in 
percent. In each column, the best value is indicated in bold. 



produced assemblies with the highest NG50 and largest 
assembled contig/scaffold. 

Using only paired-end library IDBA-UD, SPAdes 2.4 and 
exSPAnder recovered the largest fraction of the genome 
(>98.5%). However, the highest genome fraction of the assembly 
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Table 5. Comparison of scaffolds for the S.aureus single-cell dataset 



Assembler 


NG50 


Number of 
scaffolds 


Largest 


Number 
of mis 


OF 


Only paired-end library 










ABySS 


28 


910 


91 


270 


98.2 


Ray 


21 


306 


108 


14 


88.7 


Velvet 


10 


538 


56 


2 


93.2 


Velvet-SC 


10 


620 


56 


5 


94.2 


IDBA-UD 


88 


382 


161 


8 


98.2 


SPAdes 2.4 


99 


391 


230 


8 


99.2 


exSPAnder 


148 


357 


426 


4 


98.6 


Both paired-end and jumping libraries 








ABySS 


30 


852 


91 


275 


98.0 


ALLPATHS-LG 


40 


165 


132 


69 


79.9 


Ray 


100 


169 


486 


25 


93.5 


IDBA-UD 


55 


397 


161 


9 


98.6 


exSPAnder 


314 


302 


603 


9 


99.3 



Note: NG50 is given in kb; number of scaffolds is the total number of scaffolds 
>500bp; largest stands for the length (in kb) of the longest scaffold assembled; 
number of mis is the number of misassemblies; GF stands for the fraction of 
genome mapped given in percent. In each column, the best value is indicated in 
bold. 



generated by SPAdes 2.4 reflects the specifics of SPAdes 2.4 and 
QUAST reporting (some artifacts with reporting of repetitive 
regions) rather than real advantage of SPAdes 2.4 with respect 
to this parameter. 

When using both libraries simultaneously, exSPAnder pro- 
duced assemblies with the highest genome fraction exceeding 
99%, the largest genome fraction we saw across dozens of 
single-cell datasets assembled with SPAdes in the past 2 years. 
Moreover, Tables 4 and 5 show that exSPAnder successfully 
deals with the high rate of the chimeric read-pairs and relatively 
high variations in the insert size. 



5 CONCLUSION 

We have presented exSPAnder algorithm for resolving 
repeats using either a single or multiple read-pair libraries with 
different insert sizes, which is applicable for both single-cell and 
isolate bacterial datasets. Benchmarks across eight popular 
assemblers demonstrate that exSPAnder produces high-quality 
assemblies for datasets of different types. Additionally, as 
illustrated by recent integration of lUumina and PacBio reads 
in SPAdes 3.0, exSPAnder is a flexible approach that can be 
easily modified to work with diverse types of sequencing data. 
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